Introduction¶
21-11-2024¶
This logbook will describe the process of creating visualisations, ideas. These visualisations and ideas will be used to create an application for research students. This application will take DNA methylation data as input. This app will make it easier for the students to look into their generated data, and it will help them with understanding their data.
Loading in the data¶
21-11-2024¶
I would like to combine the data from all the files into one single file, with the id in the column of the df. This way i could compare different conditions to eachother.
The first code-block is to load in the used libraries.
import os
import seaborn as sns
import matplotlib.pyplot as plt
import polars as pl
import numpy as np
import datashader as ds
import datashader.transfer_functions as tf
from Bio import SeqIO
import hvplot.polars
import geneview
import pandas as pd
import re
import plotly.express as px
barcodes_names: pl.dataframe = pl.read_csv("/commons/Themas/Thema06/Methylatie/barcodes.csv")
barcodes_names = barcodes_names.with_columns(controle_n = pl.int_range(pl.len()).over(" description")+1)
barcodes_names = barcodes_names.with_columns(group_and_n = pl.concat_str([pl.col(' description'), pl.col("controle_n")]))
barcodes_names = barcodes_names.with_columns(pl.col(pl.Utf8).str.strip_chars()).drop("controle_n")
print(barcodes_names)
shape: (5, 3) āāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāā ā barcode ā description ā group_and_n ā ā --- ā --- ā --- ā ā i64 ā str ā str ā āāāāāāāāāāāŖāāāāāāāāāāāāāāāāāāāāāāŖāāāāāāāāāāāāāāāāāāāāāāā” ā 11 ā Jurkat_DMSO_control ā Jurkat_DMSO_control1 ā ā 12 ā Jurkat_betuline ā Jurkat_betuline1 ā ā 13 ā Healthy_control ā Healthy_control1 ā ā 14 ā Jurkat_betuline ā Jurkat_betuline2 ā ā 15 ā Jurkat_DMSO_control ā Jurkat_DMSO_control2 ā āāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāāā
This generates a data frame that contains the barcode and also the description of the barcode The column called group_and_n contains the description with a control group number
This is needed to label the different groups in the df that will contain all of the data
Which will be loaded in the code below this block
path: str = "/commons/Themas/Thema06/Methylatie/analysis"
def load_files(path: str) -> pl.dataframe:
resulting_df: pd.DataFrame = pl.DataFrame(
{"chr":[],
"start":[],
"end":[],
"frac":[],
"valid":[],
"group_name":[]}
)
files: list[str] = os.listdir(path)
for file in files:
if os.path.isfile(f"{path}/{file}") and file.endswith("methylatie_ALL.csv"):
temp_df: pd.DataFrame = pd.read_csv(f"{path}/{file}", sep="\t")
temp_df: pl.DataFrame = pl.from_pandas(temp_df)
barcode_num: list[int] = re.findall(r"\d+", file)
name_group: str = barcodes_names.filter(pl.col("barcode").cast(pl.String) == barcode_num[0]).select("group_and_n")
temp_df: pl.DataFrame = temp_df.with_columns(pl.lit(name_group).alias("group_name"))
resulting_df = pl.concat([temp_df, resulting_df])
return resulting_df
df: pl.DataFrame = load_files(path=path)
All of the csv files are now loaded into 1 polars dataframe
print(df.head())
shape: (5, 6) āāāāāāāā¬āāāāāāāā¬āāāāāāāā¬āāāāāāā¬āāāāāāāā¬āāāāāāāāāāāāāāāāāāā ā chr ā start ā end ā frac ā valid ā group_name ā ā --- ā --- ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā f64 ā i64 ā str ā āāāāāāāāŖāāāāāāāāŖāāāāāāāāŖāāāāāāāŖāāāāāāāāŖāāāāāāāāāāāāāāāāāāā” ā chr1 ā 10468 ā 10469 ā 1.0 ā 1 ā Jurkat_betuline2 ā ā chr1 ā 10470 ā 10471 ā 1.0 ā 2 ā Jurkat_betuline2 ā ā chr1 ā 10488 ā 10489 ā 1.0 ā 2 ā Jurkat_betuline2 ā ā chr1 ā 10492 ā 10493 ā 1.0 ā 2 ā Jurkat_betuline2 ā ā chr1 ā 10496 ā 10497 ā 1.0 ā 1 ā Jurkat_betuline2 ā āāāāāāāā“āāāāāāāā“āāāāāāāā“āāāāāāā“āāāāāāāā“āāāāāāāāāāāāāāāāāāā
Data exploration¶
22-11-2024¶
I now have a data frame with the methylation data with a column called group_name that holds the name of the group of which the data comes from
test: pl.DataFrame = df.filter(
pl.col("group_name").is_in(['Healthy_control1', 'Jurkat_betuline1', 'Jurkat_betuline2'])
)
test: pl.DataFrame = df.filter((pl.col("start") >= 60778131) &
(pl.col("end") <= 60778731) &
(pl.col("chr") == "chr10"))
print(test)
shape: (83, 6) āāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāā¬āāāāāāāā¬āāāāāāāāāāāāāāāāāāā ā chr ā start ā end ā frac ā valid ā group_name ā ā --- ā --- ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā f64 ā i64 ā str ā āāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāŖāāāāāāāāŖāāāāāāāāāāāāāāāāāāā” ā chr10 ā 60778212 ā 60778213 ā 0.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60778217 ā 60778218 ā 0.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60778237 ā 60778238 ā 0.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60778258 ā 60778259 ā 0.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60778283 ā 60778284 ā 0.0 ā 1 ā Jurkat_betuline2 ā ā ⦠ā ⦠ā ⦠ā ⦠ā ⦠ā ⦠ā ā chr10 ā 60778682 ā 60778683 ā 0.0 ā 3 ā Healthy_control1 ā ā chr10 ā 60778689 ā 60778690 ā 0.0 ā 3 ā Healthy_control1 ā ā chr10 ā 60778700 ā 60778701 ā 0.0 ā 3 ā Healthy_control1 ā ā chr10 ā 60778707 ā 60778708 ā 0.0 ā 3 ā Healthy_control1 ā ā chr10 ā 60778723 ā 60778724 ā 0.0 ā 3 ā Healthy_control1 ā āāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāā“āāāāāāā“āāāāāāāā“āāāāāāāāāāāāāāāāāāā
This df contains all methylation data between the range of 60778131 and 60778731 on chr 10 (CDK1). This is a possible way to filter for all the methylated DNA in a range. This is also filtered for 3 selected groups.
all_groups = pl.DataFrame({"group_name": df["group_name"].unique()})
test_agg: pl.DataFrame = (
test
.select(["group_name", "frac"])
.group_by("group_name")
.agg([pl.len().alias("n methylations")])
.join(all_groups, on="group_name", how="full")
.with_columns(pl.col("group_name").fill_null(pl.col("group_name_right")))
.drop("group_name_right")
.fill_null(0)
)
print(test_agg)
shape: (5, 2) āāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāā ā group_name ā n methylations ā ā --- ā --- ā ā str ā u32 ā āāāāāāāāāāāāāāāāāāāāāāāāŖāāāāāāāāāāāāāāāāā” ā Healthy_control1 ā 44 ā ā Jurkat_betuline2 ā 39 ā ā Jurkat_DMSO_control2 ā 0 ā ā Jurkat_DMSO_control1 ā 0 ā ā Jurkat_betuline1 ā 0 ā āāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāā
This df contains the number of methylations in the range specified in the test df. It appears that there's only methylations for the healthy control group and 1 of the 2 betuline control groups. These results overlap with the results found by the students for this gene (CDK1).
sns.set_theme()
sns.barplot(data = test_agg.sort("n methylations", descending=True),
y = "group_name", x = "n methylations",
hue="group_name", palette="Set2")
plt.show()
This plot showcases the amount of methylations for a certain gene (CDK1). It appears that there's only methylated DNA for 2 groups.
These results overlap with the research and processing of the data that the research students have done.
25-11-2024¶
- Extra md explanation for code above this block.
The following thing i would like to check is if the difference between end and start are always 1. This is to check if there is any possibly faulty data.
start_end_diff: pl.DataFrame = (
df
.filter(pl.col("end") - pl.col("start") != 1)
)
print(start_end_diff)
shape: (0, 6) āāāāāāā¬āāāāāāāā¬āāāāāā¬āāāāāāā¬āāāāāāāā¬āāāāāāāāāāāāā ā chr ā start ā end ā frac ā valid ā group_name ā ā --- ā --- ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā f64 ā i64 ā str ā āāāāāāāŖāāāāāāāāŖāāāāāāŖāāāāāāāŖāāāāāāāāŖāāāāāāāāāāāāā” āāāāāāā“āāāāāāāā“āāāāāā“āāāāāāā“āāāāāāāā“āāāāāāāāāāāāā
This ouput means that the difference between end and start is always equal to 1. This means that there are no faulty positions.
I would like to see if there are any major differences between the different groups and the amount of methylated DNA.
df_n_methylation: pl.DataFrame = (
df
.select("group_name")
.group_by("group_name")
.agg([pl.len().alias("n methylations")])
)
print(df_n_methylation.head())
shape: (5, 2) āāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāā ā group_name ā n methylations ā ā --- ā --- ā ā str ā u32 ā āāāāāāāāāāāāāāāāāāāāāāāāŖāāāāāāāāāāāāāāāāā” ā Healthy_control1 ā 14820163 ā ā Jurkat_DMSO_control1 ā 2541070 ā ā Jurkat_betuline2 ā 2654900 ā ā Jurkat_DMSO_control2 ā 2272843 ā ā Jurkat_betuline1 ā 1994070 ā āāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāā
This table holds the total amount of methylations. Visualising this table would make it easier to see any possible differences.
sns.barplot(data = df_n_methylation,
y = "group_name", x = "n methylations",
hue="group_name", palette="Set2",).set(
title="Amount of methylations for all groups",
xlabel="Amount of methylations", ylabel = "Group name")
plt.show()
This plot visualises the amount of methylations for every group that is part of the experiment. The x-axis holds the number of methylations, while the y-axis holds the name of the group that the number belongs to.
This plot clearly showcases that the healthy control group has way more methylations then the other groups. This implies that the other groups might have some sort of effect on the methylation. It is unclear if this is the betuline, the DMSO control group also appears to impact the methylation.
I could possibly zoom more into to other groups, to visualise the differences between the treated groups.
sns.barplot(data = df_n_methylation.filter(pl.col("group_name") != "Healthy_control1"),
y = "group_name", x = "n methylations",
hue="group_name", palette="Set2",).set(
title="Amount of methylations for all treated groups",
xlabel="Amount of methylations", ylabel = "Group name")
plt.show()
This plot visualises the amount of methylations for every group that is part of the experiment. The x-axis holds the number of methylations, while the y-axis holds the name of the group that the number belongs to.
This visualises that there does not appear to be a pattern between and inside of groups. To make data selection i need to do the following:
- Give the given bedfile gene symbols to easier look for specific genes
I will clean the bedfile first, it is very inconsistant with tabs and spaces.
BED file annotation¶
26-11-2024¶
path_bed = "/commons/Themas/Thema06/Methylatie/RRMS_human_hg38.bed"
file_cleaned = []
with open(path_bed, 'r') as bed_file:
for line in bed_file:
replaced_line = re.sub(r"\s+", "\t", line.strip())
file_cleaned.append(replaced_line)
with open("../data/new_bed_file.bed", "w") as new_bed:
new_bed.write("chr\tstart\tend\n")
new_bed.write("\n".join(file_cleaned))
bed_df = pl.from_pandas(pd.read_csv("../data/new_bed_file.bed", sep="\t"))
print(bed_df)
shape: (18_069, 3) āāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāā ā chr ā start ā end ā ā --- ā --- ā --- ā ā str ā i64 ā i64 ā āāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāā” ā chr1 ā 24735 ā 33737 ā ā chr1 ā 131124 ā 139563 ā ā chr1 ā 195251 ā 204121 ā ā chr1 ā 364792 ā 386185 ā ā chr1 ā 487107 ā 495546 ā ā ⦠ā ⦠ā ⦠ā ā chr18 ā 59898996 ā 59900196 ā ā chr19 ā 47220224 ā 47221024 ā ā chr11 ā 798884 ā 799484 ā ā chr10 ā 60778131 ā 60778731 ā ā chr17 ā 7667421 ā 7668621 ā āāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāā
biomart_bed = pl.from_pandas(pd.read_csv("/homes/rreilman/Downloads/mart_export.txt", low_memory=False))
biomart_bed = biomart_bed.select(["Chromosome/scaffold name", "Gene start (bp)", "Gene end (bp)", "Gene name"])
biomart_bed = biomart_bed.rename({
"Chromosome/scaffold name": "chr",
"Gene start (bp)": "start",
"Gene end (bp)": "end",
"Gene name": "gene_name"
})
biomart_bed = biomart_bed.with_columns(
pl.col("gene_name").fill_null("unknown gene")
)
print(biomart_bed.head())
shape: (5, 4) āāāāāāā¬āāāāāāāā¬āāāāāāā¬āāāāāāāāāāāā ā chr ā start ā end ā gene_name ā ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā str ā āāāāāāāŖāāāāāāāāŖāāāāāāāŖāāāāāāāāāāāā” ā MT ā 577 ā 647 ā MT-TF ā ā MT ā 648 ā 1601 ā MT-RNR1 ā ā MT ā 1602 ā 1670 ā MT-TV ā ā MT ā 1671 ā 3229 ā MT-RNR2 ā ā MT ā 3230 ā 3304 ā MT-TL1 ā āāāāāāā“āāāāāāāā“āāāāāāā“āāāāāāāāāāāā
def annotate_bed(bed_df: pl.DataFrame, annotate_df: pl.DataFrame):
new_df = []
for promoter in bed_df.iter_rows():
chr_promoter, start_promoter, end_promoter = promoter
overlaps = (annotate_df
.filter(
(pl.col("chr") == chr_promoter.replace("chr", "")) &
(pl.col("start") <= end_promoter) &
(pl.col("end") >= start_promoter)
)
)
if not overlaps.is_empty():
for gene in overlaps["gene_name"].to_list():
new_df.append({"chr":chr_promoter,
"start":start_promoter,
"end":end_promoter,
"gene_name":gene})
else:
new_df.append({"chr":chr_promoter,
"start":start_promoter,
"end":end_promoter,
"gene_name":"Unknown gene"})
return pl.DataFrame(new_df)
#bed_new_df = annotate_bed(bed_df, biomart_bed)
The bed file should now contain the promotor locations with fitting genes. Lets test this by searching for a gene the students used for their research
#print(bed_new_df.filter(pl.col("gene_name").list.contains("CDK1")))
This result shows that there is no CDK1 found in the promoter bed file, which is false considering the students used it. Martijn told me about annotation from GBFF file, im going to look into that.
28-11-2024¶
i'm going to look at annotating my bed file via a GBFF file from NCBI.
gbff_file = "/homes/rreilman/jaar2/ncbi_dataset/data/GCF_000001405.26/genomic.gbff"
gene_information = []
for record in SeqIO.parse(gbff_file, "genbank"):
chr_name = record.id
for feature in record.features:
if feature.type == "gene":
start = int(feature.location.start)
end = int(feature.location.end)
gene_name = feature.qualifiers.get("gene", ["Unknown"])[0]
gene_information.append({"chr": chr_name,
"start":max(0, start-1000),
"end":end,
"gene_name":gene_name})
gbff_gene_df = pl.DataFrame(gene_information)
print(gbff_gene_df.filter(pl.col("gene_name") == "CDK1"))
shape: (1, 4) āāāāāāāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāāā ā chr ā start ā end ā gene_name ā ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā str ā āāāāāāāāāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāā” ā NC_000010.11 ā 60771975 ā 60794852 ā CDK1 ā āāāāāāāāāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāāā
This df contains the chr start end and name of every gene in the GBFF file The chromosome naming convention does not match the way our bed file is made (chr*) so i will have to change that.
chrome_mapping = pl.from_pandas(pd.read_csv("/homes/rreilman/jaar2/chromosome_mapping.csv", delimiter="\t"))
print(chrome_mapping)
shape: (455, 2) āāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāā ā RefSeq seq accession ā Chromosome name ā ā --- ā --- ā ā str ā str ā āāāāāāāāāāāāāāāāāāāāāāāāŖāāāāāāāāāāāāāāāāāā” ā NC_000001.11 ā 1 ā ā NC_000002.12 ā 2 ā ā NC_000003.12 ā 3 ā ā NC_000004.12 ā 4 ā ā NC_000005.10 ā 5 ā ā ⦠ā ⦠ā ā NT_187685.1 ā 19 ā ā NT_187686.1 ā 19 ā ā NT_187687.1 ā 19 ā ā NT_113949.2 ā 19 ā ā NC_012920.1 ā MT ā āāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāā
This df contains the NCBI naming convention and the way i have named my chromosomes.
gbff_gene_df_updated = gbff_gene_df.join(chrome_mapping, left_on="chr", right_on="RefSeq seq accession", how="left")
gbff_gene_df_updated = gbff_gene_df_updated.with_columns(
pl.when(pl.col("Chromosome name").is_not_null())
.then(pl.col("Chromosome name"))
.otherwise(pl.col("chr"))
.alias("chr")
).select(["chr", "start", "end", "gene_name"])
print(gbff_gene_df_updated.filter(pl.col("gene_name") == "CDK1"))
shape: (1, 4) āāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāāā ā chr ā start ā end ā gene_name ā ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā str ā āāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāā” ā 10 ā 60771975 ā 60794852 ā CDK1 ā āāāāāāā“āāāāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāāā
This dataframe now used our naming convention, instead of NCBI chromosome naming convention. The next step is to annotate out bed dataframe.
annotated_bed_file = annotate_bed(bed_df, gbff_gene_df_updated)
print(annotated_bed_file)
#annotated_bed_file.write_csv("../data/annotated_bed.bed")
shape: (30_681, 4) āāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāāāāāā ā chr ā start ā end ā gene_name ā ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā str ā āāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāāāāā” ā chr1 ā 24735 ā 33737 ā WASH7P ā ā chr1 ā 24735 ā 33737 ā MIR1302-2 ā ā chr1 ā 24735 ā 33737 ā FAM138A ā ā chr1 ā 24735 ā 33737 ā LOC102724250 ā ā chr1 ā 24735 ā 33737 ā TRNAN-GUU ā ā ⦠ā ⦠ā ⦠ā ⦠ā ā chr19 ā 47220224 ā 47221024 ā BBC3 ā ā chr11 ā 798884 ā 799484 ā PANO ā ā chr11 ā 798884 ā 799484 ā PIDD ā ā chr10 ā 60778131 ā 60778731 ā CDK1 ā ā chr17 ā 7667421 ā 7668621 ā TP53 ā āāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāāāāāā
I will now create the same barplot as before, to check if the result is the same.
def get_gene_info(genes_list: list[str], annotated_bed_df):
df_wanted = (annotated_bed_df
.filter(pl.col("gene_name").is_in(genes_list)))
return df_wanted
df_cdk1 = get_gene_info(["CDK1"], annotated_bed_file)
print(df_cdk1)
shape: (2, 4) āāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāāā ā chr ā start ā end ā gene_name ā ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā str ā āāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāā” ā chr10 ā 60774212 ā 60783104 ā CDK1 ā ā chr10 ā 60778131 ā 60778731 ā CDK1 ā āāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāāā
This dataframe contains the promoter areas for the CDK1 gene. I will use these to filter the df with the methylation data. And create a barplot
methylation_cdk1 = (df
.filter((pl.col("chr").is_in(df_cdk1.select("chr"))) &
(pl.col("start") >= df_cdk1.select("start").to_numpy()[0]) &
(pl.col("end") <= df_cdk1.select("end").to_numpy()[0])))
print(methylation_cdk1)
test_agg2: pl.DataFrame = (
methylation_cdk1
.select(["group_name", "frac"])
.group_by("group_name")
.agg([pl.len().alias("n methylations")])
.join(all_groups, on="group_name", how="full")
.with_columns(pl.col("group_name").fill_null(pl.col("group_name_right")))
.drop("group_name_right")
.fill_null(0)
)
print(test_agg2.head())
sns.barplot(data = test_agg2.sort("n methylations", descending=True),
y = "group_name", x = "n methylations",
hue="group_name", palette="Set2")
plt.show()
shape: (248, 6) āāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāā¬āāāāāāāā¬āāāāāāāāāāāāāāāāāāā ā chr ā start ā end ā frac ā valid ā group_name ā ā --- ā --- ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā f64 ā i64 ā str ā āāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāŖāāāāāāāāŖāāāāāāāāāāāāāāāāāāā” ā chr10 ā 60774589 ā 60774590 ā 1.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60774731 ā 60774732 ā 1.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60775086 ā 60775087 ā 1.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60775117 ā 60775118 ā 1.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60775225 ā 60775226 ā 1.0 ā 1 ā Jurkat_betuline2 ā ā ⦠ā ⦠ā ⦠ā ⦠ā ⦠ā ⦠ā ā chr10 ā 60780807 ā 60780808 ā 1.0 ā 2 ā Healthy_control1 ā ā chr10 ā 60780818 ā 60780819 ā 1.0 ā 2 ā Healthy_control1 ā ā chr10 ā 60781020 ā 60781021 ā 1.0 ā 3 ā Healthy_control1 ā ā chr10 ā 60781482 ā 60781483 ā 1.0 ā 4 ā Healthy_control1 ā ā chr10 ā 60781532 ā 60781533 ā 1.0 ā 6 ā Healthy_control1 ā āāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāā“āāāāāāā“āāāāāāāā“āāāāāāāāāāāāāāāāāāā shape: (5, 2) āāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāā ā group_name ā n methylations ā ā --- ā --- ā ā str ā u32 ā āāāāāāāāāāāāāāāāāāāāāāāāŖāāāāāāāāāāāāāāāāā” ā Healthy_control1 ā 125 ā ā Jurkat_betuline2 ā 123 ā ā Jurkat_DMSO_control2 ā 0 ā ā Jurkat_DMSO_control1 ā 0 ā ā Jurkat_betuline1 ā 0 ā āāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāā
This outputs a similar result to the first barplot The only difference is that the annotation added an extra promoter area that has been connected to the CDK1 gene. This is usable for now, but i'll have to ask Martijn if it is correct. The next steps will be cleaning up a bit of my code.
The way the main df is being filtered is currently incorrect, i will have to create a function for this to make it more usable
03-12-2024¶
I noticed that i copied code to select dataa, i am going to create a function for this, so that it'll be easier. This function will take a filtered promoter_bed_df with annotated gene named, filtered on the genes of interest. The function uses that dataframe to find all methylation data in those promoter regions.
def filter_data_from_bed(main_df, promoter_df):
final_subsetted_df: pd.DataFrame = pl.DataFrame(
{"chr":[],
"start":[],
"end":[],
"frac":[],
"valid":[],
"group_name":[]}
)
for row in promoter_df.iter_rows():
(chromosome, promoter_start, promoter_end, gene_name) = row
subsetted_df = main_df.filter(
(pl.col("chr") == chromosome) &
(pl.col("start") >= promoter_start) &
(pl.col("end") <= promoter_end)
)
final_subsetted_df = pl.concat([subsetted_df, final_subsetted_df])
return final_subsetted_df
cdk1_test_df = filter_data_from_bed(df, df_cdk1)
print(cdk1_test_df.head())
shape: (5, 6) āāāāāāāāā¬āāāāāāāāāāā¬āāāāāāāāāāā¬āāāāāāā¬āāāāāāāā¬āāāāāāāāāāāāāāāāāāā ā chr ā start ā end ā frac ā valid ā group_name ā ā --- ā --- ā --- ā --- ā --- ā --- ā ā str ā i64 ā i64 ā f64 ā i64 ā str ā āāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāāāāāŖāāāāāāāŖāāāāāāāāŖāāāāāāāāāāāāāāāāāāā” ā chr10 ā 60778212 ā 60778213 ā 0.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60778217 ā 60778218 ā 0.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60778237 ā 60778238 ā 0.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60778258 ā 60778259 ā 0.0 ā 1 ā Jurkat_betuline2 ā ā chr10 ā 60778283 ā 60778284 ā 0.0 ā 1 ā Jurkat_betuline2 ā āāāāāāāāā“āāāāāāāāāāā“āāāāāāāāāāā“āāāāāāā“āāāāāāāā“āāāāāāāāāāāāāāāāāāā
04-12-2024¶
To see if this works, i will create a function that can count the amount of methylation data.
def count_methylation_data(main_df):
return (main_df
.select(["group_name", "frac"])
.group_by("group_name")
.agg([pl.len().alias("n methylations")])
.join(all_groups, on="group_name", how="full")
.with_columns(pl.col("group_name").fill_null(pl.col("group_name_right")))
.drop("group_name_right")
.fill_null(0)
)
cdk1_count_data = count_methylation_data(cdk1_test_df)
cdk1_count_data
| group_name | n methylations |
|---|---|
| str | u32 |
| "Healthy_control1" | 169 |
| "Jurkat_betuline2" | 162 |
| "Jurkat_DMSO_control2" | 0 |
| "Jurkat_DMSO_control1" | 0 |
| "Jurkat_betuline1" | 0 |
This table can be used as data for barplot, to showcase how many methylations every group has for the selected gene(s). I'm going to create a function that will plot this data into a bar plot
def code_count_data(data_df):
sns.barplot(data = data_df.sort("n methylations", descending=True),
y = "group_name", x = "n methylations",
hue="group_name", palette="Set2")
plt.show()
code_count_data(cdk1_count_data)
What this showcases is that there were multiple promoter sites found for the CDK1, sites that possibly weren't found by Martijn.
fig = px.bar(cdk1_count_data, y="n methylations", x="group_name", color="group_name",
title="Amount of methylations for the CDK1 gene")
fig.show()
09-12-2024¶
hvplots can be used to create super customisable interactive plots, which i will use for the dashboard.
def plot_barchart(df: pl.DataFrame):
barplot = df.hvplot.bar(x = "group_name", y="n methylations",
color = "group_name", cmap = "Category10", width = 900)
return barplot
plot_barchart(cdk1_count_data)
This function generates an interactive plot, perfect for a website.
10-12-2024¶
Now i have to see what other plots i can use to visualise The scatter plot might be usuable to plot the specific methylation points. This should not be a fast plot, with all of the data, so the user must filter on range, chr and group to speed it up.
def plot_scatter(df):
df = (df.
filter(pl.col("chr").is_in("chr"+chrome_mapping["Chromosome name"].unique())))
df = df.sample(fraction=0.02)
return df.hvplot.scatter(x = "start", y="chr", by="group_name", width = 900)
plot_scatter(df)